No one outside the group
Information on the competition can be found here.
First, to get a general idea of the dataset, let’s make a correlation plot to see the relationships between the variables. Additionally the correlation values of the 8 test variables in relation to the count variable.
## [,1]
## season 0.163439017
## holiday -0.005392984
## workingday 0.011593866
## weather -0.128655201
## temp 0.394453645
## atemp 0.389784437
## humidity -0.317371479
## windspeed 0.101369470
It appears that none of the continuous variables have a strikingly clear relationship with the target variable. There appears to be useful explainitory information in some of the factor variables, but largely only in the upper range of the data (see box plots)
Then, let’s look at the spread of the response variable, ‘count’.
ggplot(data = train, aes(x = count)) + geom_density(adjust = 0.77) + labs(title = "Spread ")
fav_stats(train$count)
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 42 | 145 | 284 | 977 | 191.5741 | 181.1445 | 10886 | 0 |
The density plot is fairly rightskewed, with most of the values being
We can see the relationship between our response variable and three predictors we selected, to see how bike rentals are impacted by Temperature, Humidity, and whether the day in question is a holiday.
train$holiday1<-0
train$holiday1[which(train$holiday == 0)] <- 'Not Holiday'
train$holiday1[which(train$holiday == 1)] <- 'Holiday'
train$holiday <- as.factor(train$holiday)
p <- plot_ly(train, x = ~temp, y = ~humidity, z = ~count, color= ~holiday1, colors = c('#BF382A', '#0C4B8E')) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Humidity'),
yaxis = list(title = 'Temperature (Celsius)'),
zaxis = list(title = 'Count')))
ggplotly(p)
We are picking a model which uses Windspeed, temperature, and Humidity as the three predictors to predict Count with. JUSTIFICATION:
In the correlation plot above, these 3 variables had the strongest apparent relationship to “count”. Windspeed was used instead of adjusted temperature to avoid obvious covariance between temp and atemp.
The following represents the model being applied to the test set via predict():
train <- read_csv("data/train.csv")
model1<- lm(log(count)~ windspeed+temp+humidity, data=train)
summary(model1)
##
## Call:
## lm(formula = log(count) ~ windspeed + temp + humidity, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5604 -0.6476 0.2639 0.8750 3.2963
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.511861 0.065391 68.998 < 2e-16 ***
## windspeed 0.005989 0.001616 3.706 0.000212 ***
## temp 0.067915 0.001609 42.216 < 2e-16 ***
## humidity -0.022783 0.000687 -33.162 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.304 on 10882 degrees of freedom
## Multiple R-squared: 0.2334, Adjusted R-squared: 0.2332
## F-statistic: 1104 on 3 and 10882 DF, p-value: < 2.2e-16
mplot(model1)[1:2]
## [[1]]
##
## [[2]]
Test_predict <- predict(model1, newdata = test)
Test_predict_table <- data_frame(Test_predict) %>%
mutate(count = exp(Test_predict)) %>%
mutate(datetime = sample_submission$datetime) %>%
select(datetime, count)
All of the predictors in this model are significant. That said, most variants of this model (including when we log and/or take the square roots) violate the equal variance and normality of error conditions.
The following code is intended to be used in re-submission for cross validation
set.seed(3400)
nfold <- 5
nrow(train)/nfold
## [1] 2177.2
k <- 5
Fold_train <- train %>%
sample_frac(1) %>%
mutate(fold = rep(1:k, len=nrow(train)))
"indices <- sample(nrow(train), nrow(train), replace=FALSE)
splits<-split(indices, ceiling(seq_along(indices)/2177))
splits<-as.data.frame(splits)
d1a <- train[splits$X1,]
d1b <- train[splits$X2,]
d1c <- train[splits$X3,]
d1d <- train[splits$X4,]
d1e <- train[splits$X5,]
folds <- list[d1a,d1b,d1c,d1d,d1e]
rmse1<-0
rmse2<-0
rmse3<-0
rmse4<-0
rmse5<-0
rmses <- c(rmse1,rmse2,rmse3,rmse4,rmse5)"
## [1] "indices <- sample(nrow(train), nrow(train), replace=FALSE)\nsplits<-split(indices, ceiling(seq_along(indices)/2177))\nsplits<-as.data.frame(splits)\nd1a <- train[splits$X1,]\nd1b <- train[splits$X2,]\nd1c <- train[splits$X3,]\nd1d <- train[splits$X4,]\nd1e <- train[splits$X5,]\nfolds <- list[d1a,d1b,d1c,d1d,d1e]\nrmse1<-0\nrmse2<-0\nrmse3<-0\nrmse4<-0\nrmse5<-0\nrmses <- c(rmse1,rmse2,rmse3,rmse4,rmse5)"
"for (i in 1:5){
rmses[i-1] <- lm(count~temp~workingday+weather, data = subset(Fold_train, fold = i)) %>%
predict(subset(Fold_train, fold != i)) %>%
`-` (subset(Fold_train, fold != i)$count) %>%
`^`(2) %>%
sum %>%
`/` (subset(Fold_train, fold != i)$count) %>%
sqrt
}"
## [1] "for (i in 1:5){\n rmses[i-1] <- lm(count~temp~workingday+weather, data = subset(Fold_train, fold = i)) %>% \n predict(subset(Fold_train, fold != i)) %>% \n `-` (subset(Fold_train, fold != i)$count) %>% \n `^`(2) %>% \n sum %>% \n `/` (subset(Fold_train, fold != i)$count) %>% \n sqrt\n}"
write.csv(Test_predict_table, "submission.csv", row.names = F)
. every plot has all labels (including units) . code chunks are hidden in the output . our names + date etc are all up there